Error correction translation using text corpora

نویسندگان

  • Kai Ishikawa
  • Eiichiro Sumita
چکیده

In this paper, we propose an error correction method using text corpora. In this method, recognition errors are corrected using phonetically similar examples in the text corpora. The reliability of the correction hypotheses are judged according to their semantic consistency and their phonetic similarity to the original input. We previously proposed an error correction method that uses a treebank [1]. However, the previous method was not flexible in its use of examples, because structural mismatches occurred between the input and examples due to recognition errors. In our new proposal, examples are treated as morpheme sequences. This enables us to use examples partially when there are no useful full-sentence-examples. We built our proposed method into a speech translation system and compared the translation quality for simple translation and translation with error correction. The rate of acceptable translation increased about 10% with our proposed method compared to simple translation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings

English as a Second Language (ESL) learners’ writings contain various grammatical errors. Previous research on automatic error correction for ESL learners’ grammatical errors deals with restricted types of learners’ errors. Some types of errors can be corrected by rules using heuristics, while others are difficult to correct without statistical models using native corpora and/or learner corpora...

متن کامل

Constrained Grammatical Error Correction using Statistical Machine Translation

This paper describes our use of phrasebased statistical machine translation (PBSMT) for the automatic correction of errors in learner text in our submission to the CoNLL 2013 Shared Task on Grammatical Error Correction. Since the limited training data provided for the task was insufficient for training an effective SMT system, we also explored alternative ways of generating pairs of incorrect a...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

NAIST at 2013 CoNLL Grammatical Error Correction Shared Task

This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the CoNLL 2013 Shared Task. We constructed three systems: a system based on the Treelet Language Model for verb form and subjectverb agreement errors; a classifier trained on both learner and native corpora for noun number errors; a statistical machine translation (SMT)-based model for prepositi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999